-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16936 dtx: rank range based DTX hints for coll_punch #15979
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'Aurora: assertion with collective punch running mdtest with 2048 ECBS' |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15979/1/testReport/ |
d30dc42
to
6a4b687
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15979/2/testReport/ |
6a4b687
to
15e44d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the PR should re-enable collective punch as default if the fix was for it.
otherwise no test will be running with collective punch in CI anyway.
Yes, collective punch has already been enabled on master, but disabled on release/2.6 by default. This patch is for master, so no need to change that. |
https://github.com/daos-stack/daos/actions/runs/13547004393/job/37861024957?pr=15979
@daltonbohning , |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15979/3/execution/node/1510/log |
Failed for DAOS-17094. |
The hook will format for you |
For cases where it sucks, you can add comments /* clang-format off / |
clang-format actually does a very good job of maintaining consistency in most cases but you do have the option to disable it in certain sections of code. Installing the githooks will save you a lot of hassle otherwise. |
to answer the other question, the format is controlled by .clang-format file in the daos root tree |
15e44d6
to
b279dd6
Compare
Thanks @daltonbohning @jolivier23 |
When handle collective punch RPC, the leader will generate hints array for related DTX RPC (abort or commit). Each involved engine will have one element in such array. To simplify RPC logic, such hints array is sparse and indexed with engine's rank#. Originally, we did not properly handled the case of incontinuous rank# in the pool map, as to when update some hints element with large rank#, the write maybe out of boundary and crash the others' space and cause kinds of DRAM corruption. Similar situation can happen during handle collective DTX resync and cleanup. This patch fixes such issue via building such sparse hints array based on related engines' rank# range instead of the ranks count in the pool map. Use relative ranks diff (real rank# - base rank#) as the index. That will avoid out of boundary access. Signed-off-by: Fan Yong <[email protected]>
b279dd6
to
20c76c8
Compare
When handle collective punch RPC, the leader will generate hints array for related DTX RPC (abort or commit). Each involved engine will have one element in such array. To simplify RPC logic, such hints array is sparse and indexed with engine's rank#. Originally, we did not properly handled the case of incontinuous rank# in the pool map, as to when update some hints element with large rank#, the write maybe out of boundary and crash the others' space and cause kinds of DRAM corruption.
Similar situation can happen during handle collective DTX resync and cleanup.
This patch fixes such issue via building such sparse hints array based on related engines' rank# range instead of the ranks count in the pool map. Use relative ranks diff (real rank# - base rank#) as the index. That will avoid out of boundary access.
Steps for the author:
After all prior steps are complete: